Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RzShell: refactor string, regex and byte search #4762

Closed
wants to merge 194 commits into from

Conversation

Rot127
Copy link
Member

@Rot127 Rot127 commented Dec 10, 2024

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed)

Detailed description

Changes made

  • Moves all legacy search commands to RzShell (only commands, inside they still do their string parsing on arguments).
  • Refactor string and byte search
  • Move to RzShell
  • Moves: / to /z.
  • Add support for Unicode and EBCDIC string search.
  • Add support for (Unicode) regex string search.
  • Add support for byte string regex search /xr.
  • Add more details to the search help messages.
  • Offsets of the search hits align with the actual encoding. Not with the UTF-8 encoding.
  • Dispatches memory chunks for search into threads.
  • Changes to ps
    • Adds extra arguments to specify encoding (also EBCDIC).
    • Add additional delimiter argument (stop at first non-printable).
    • Document it more.
    • Add psu alias for ps utf8
  • Changes to Settings
    • Remove str.search.max_uni_blocks - Effectively a metric the user should not know about; adds too much complexity.
    • str.search.max_threads -> search.max_threads - This is a general setting for the search now.
    • str.search.raw_alignment -> search.str.raw_alignment - Unify settings (only used for RzBin search.).
    • str.search.encoding -> str.encoding - Valid for all string interpretations.
    • str.search.min_length -> search.str.min_length - Unify settings.
    • str.search.buffer_size -> search.str.max_length - Unify settings.
    • str.search.max_region_size -> search.str.max_region_size - Unify settings.
    • str.search.check_ascii_freq -> search.str.check_ascii_freq - Unify settings.
  • Removed commands
    • /! - Because the command modifiers are not properly handled in RzShell yet and the advantage of this one is dubious (IMHO).
    • /f - Modifier and obsolete, because search is dispatched into threads.
    • /b - Modifier and obsolete, because search is dispatched into threads.
    • /+ - Because no idea what it does. Seems not particular useful.
    • /e - Replaced with regex search in bytes and string search.
    • /w - All Unicode is searched now properly with /z.
  • Make some changes to the string escaping, so it works reliably with Unicode characters.
    • The RzStrEscOptions were inconsistently used.
      E.g. show_asciidot (replace non-printable ascii with dot) was ignored for \n, \t etc.
    • Defined Unicode code points are escaped now with \U00hhhhhh. All other non-printable bytes are escaped with \xhh. There are still some exceptions (when legacy escape functions are used) but most places are ok now.
  • General
    • Fix inconsistencies in unicode decoders/encoders and checkers. They now either return 0 on an invalid decode or the number of bytes the code point requires.
    • Add many unit tests for Unicode related logic.
    • Update Unicode tables to Version 16.
    • Add helper to check code points.
    • Escaped strings now escape valid Unicode code points to /Uhhhhhh (if not requested otherwise by the user) and invalid code points to /xhh.
    • Add helper functions for hexadecimal strings and bits.

TODO Overview

What happens here:
Slowly copying changes from https://github.com/Rot127/rizin/tree/rz-search-reference (which is #4742 with some comments already addressed).

Will resend to fuzz-dist when all the tests pass here.

Stuff to do (without any order)

  • Move essential changes from https://github.com/Rot127/rizin/tree/rz-search-reference / Add / (search) commands to newshell + rewrite of rz_search #4742 before starting with commands
  • Fix commands
    • Bytes search
      • Implement
      • Fix suggestions
      • Use RzBuffer
      • Document usage in book (normal, wildcards, different maps, searching with debug + don't forget to run until mapped, API).
    • String search
      • Implement
      • Fix suggestions
        • Encoding names in command arguments.
      • Name hits based on encoding.
      • Testing
        • Case insensitivity on non latin alphabets.
        • Encoding + endianess
          • RZ_STRING_ENC_IBM290
          • RZ_STRING_ENC_EBCDIC_UK
          • RZ_STRING_ENC_EBCDIC_US
          • RZ_STRING_ENC_EBCDIC_ES
          • Fix: Offset into memory, not into the utf8 string.
        • diacritics
        • Search for strings less then 4 bytes.
        • Search within buffers which have no \0 in it (non-C strings).
        • Regex search
      • Document in book (basic usage, API, usage of distance, behavior of edge cases of distance: utf8 and other encodings, diacritics), case (in)sensitive.
  • Update book (with example API implementation).
  • Implement general search IO API which doesn't require knowledge of search spaces (what rz_search_run() was before).
  • Show progress of search.
  • Use RzBuffer for byte space searches
  • search.align shoudl default to search.alignment=asm.cpu_bits/8.
  • (Optional) Allow to cancel search with ctrl + c
  • (Optional) Extend RzSearchHit to hold more complex data.
  • (Optional) Add option to define search/divide strategy.
  • (Optional) Dispatch views into IO as chunks to the find() callback.

Test plan

...

Closing issues

closes #4910

@notxvilka

This comment was marked as resolved.

wargio

This comment was marked as resolved.

Most calls to process_one_string() never decode a valid string.
Before it allocated a complete buffer on the heap nontheless.
Even if it freed it after one iteration.

This is prevented now, by first decoding onto the stack and then
continues on the heap when the string is reasonably long.
@wargio wargio requested a review from notxvilka February 20, 2025 14:11
@Rot127
Copy link
Member Author

Rot127 commented Feb 20, 2025

Going to squash commits into reasonable parts and push to dist-fuzz.

@notxvilka
Copy link
Contributor

Going to squash commits into reasonable parts and push to dist-fuzz.

@Rot127 please open a new PR with that.

Copy link
Contributor

@notxvilka notxvilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An immense amount of work and way better test coverage. While there are still things that could be improved, I believe they could go in separate PRs to not block this PR anymore.
Thus, LGTM. Let's merge it once green and has better history! Kudos!

@@ -99,7 +99,7 @@ jobs:
os: ubuntu-22.04
build_system: meson
compiler: gcc-12
cflags: "-DASAN=1 -DRZ_ASSERT_STDOUT=1 -ftrivial-auto-var-init=pattern -funsigned-char"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open an issue about it also

@@ -31,6 +33,7 @@ typedef enum {
RZ_STRING_ENC_EBCDIC_US = 's',
RZ_STRING_ENC_EBCDIC_ES = 't',
RZ_STRING_ENC_GUESS = 'g',
RZ_STRING_ENC_SETTINGS = 'S', ///< Use str.encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wargio you never answered this one. It's okay for now, but would be nice to separate.

#include <rz_list.h>
#include <rz_th.h>

#define RZ_SEARCH_AES_LENGTH 40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in a separate file, I think, but is fine for this PR, could be done afterwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will be. With the AES search refactored. @wargio already moved it into a separted file before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this internal. no real need to move this, and since i want to refactor more that type of search with new features, i think it can be ok to be there for now.

@Rot127
Copy link
Member Author

Rot127 commented Feb 20, 2025

Close for PR with cleaned up history.

Superseded by #4919

@Rot127 Rot127 closed this Feb 20, 2025
@Rot127 Rot127 deleted the rz-search branch February 22, 2025 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rz-bin heap-buffer-overflow
6 participants